{
  "cells": [
    {
      "metadata": {
        "id": "Lkdbw7M9wAzu"
      },
      "cell_type": "markdown",
      "source": [
        "# Tokenizer\n",
        "\n",
        "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google-deepmind/gemma/blob/main/colabs/tokenizer.ipynb)\n",
        "\n",
        "This tutorial show how to use the Gemma tokenizer. Understanding tokenizer is  important to correctly feed input to the model.\n",
        "\n",
        "For more info on tokenizer, see the excelent talk from [Andrej Karpathy](https://www.youtube.com/watch?v=zduSFxRajkE)."
      ]
    },
    {
      "metadata": {
        "id": "9owh3OdyKVVv"
      },
      "cell_type": "code",
      "source": [
        "!pip install -q gemma"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "qdVsCTszv-ZV"
      },
      "cell_type": "code",
      "source": [
        "# Common imports\n",
        "\n",
        "# Gemma imports\n",
        "from gemma import gm"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "JeF4D1rMD1dT"
      },
      "cell_type": "markdown",
      "source": [
        "## Tokenizer basics\n",
        "\n",
        "Gemma tokenizers are directly available:"
      ]
    },
    {
      "metadata": {
        "id": "pYWcRKTID8pl"
      },
      "cell_type": "code",
      "source": [
        "tokenizer = gm.text.Gemma3Tokenizer()"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "cVjuis7vUoZP"
      },
      "cell_type": "markdown",
      "source": [
        "The total number of tokens is available through `.vocab_size`:"
      ]
    },
    {
      "metadata": {
        "executionInfo": {
          "elapsed": 53,
          "status": "ok",
          "timestamp": 1736782833687,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "6dOwfsT9UnjW",
        "outputId": "8cab1b0c-b181-4e5d-aa73-089e9d6dec5a"
      },
      "cell_type": "code",
      "source": [
        "tokenizer.vocab_size"
      ],
      "outputs": [
        {
          "data": {
            "text/plain": [
              "256000"
            ]
          },
          "execution_count": 71,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "BlbB-DeiHYlJ"
      },
      "cell_type": "markdown",
      "source": [
        "### Encoding\n",
        "\n",
        "You can encode a string:\n",
        "\n",
        "* Into token ids with `.encode`:"
      ]
    },
    {
      "metadata": {
        "colab": {
          "height": 34
        },
        "executionInfo": {
          "elapsed": 54,
          "status": "ok",
          "timestamp": 1736779492085,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "jWsEhGCOH7nW",
        "outputId": "4353f06a-1279-4ccc-8b7f-cf583e57df33"
      },
      "cell_type": "code",
      "source": [
        "tokenizer.encode('Derinkuyu is an underground city.')"
      ],
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[8636, 979, 78904, 603, 671, 30073, 3413, 235265]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "K4HGqQfQH_f2"
      },
      "cell_type": "markdown",
      "source": [
        "* Into token string with `.split`:"
      ]
    },
    {
      "metadata": {
        "colab": {
          "height": 34
        },
        "executionInfo": {
          "elapsed": 3,
          "status": "ok",
          "timestamp": 1736779460116,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "h6rnCNYMGv3w",
        "outputId": "d5b55516-c223-4d90-c1de-238cb7cbb291"
      },
      "cell_type": "code",
      "source": [
        "tokenizer.split('Derinkuyu is an underground city.')"
      ],
      "outputs": [
        {
          "data": {
            "text/plain": [
              "['Der', 'ink', 'uyu', ' is', ' an', ' underground', ' city', '.']"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "yJwtqK9nS9wR"
      },
      "cell_type": "markdown",
      "source": [
        "One thing to notice is that the whitespace ` ` are part of the tokens. For example, this means that for the model, ` hello` and `hello` map to 2 different token ids."
      ]
    },
    {
      "metadata": {
        "colab": {
          "height": 52
        },
        "executionInfo": {
          "elapsed": 120,
          "status": "ok",
          "timestamp": 1736782671124,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "gqkpqsLtS819",
        "outputId": "7ce46cf0-1820-4501-daba-3dc599ad41b1"
      },
      "cell_type": "code",
      "source": [
        "tokenizer.encode(' hello');\n",
        "tokenizer.encode('hello');"
      ],
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[25612]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/plain": [
              "[17534]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "uYxofYT6Tf98"
      },
      "cell_type": "markdown",
      "source": [
        "If doing next word prediction, it's important to not add a trailing space as it would make the out of distribution."
      ]
    },
    {
      "metadata": {
        "executionInfo": {
          "elapsed": 52,
          "status": "ok",
          "timestamp": 1736782686346,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "nagD29tRTvrw",
        "outputId": "64ce3bbe-a7ca-41a1-dca9-32b9b4aa746e"
      },
      "cell_type": "code",
      "source": [
        "# When encoding this sentence, the last token will be an empty whitespace,\n",
        "# which is unusual for the model.\n",
        "tokenizer.split('The capital of France is ')"
      ],
      "outputs": [
        {
          "data": {
            "text/plain": [
              "['The', ' capital', ' of', ' France', ' is', ' ']"
            ]
          },
          "execution_count": 65,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "_nGwsFMvEEsk"
      },
      "cell_type": "markdown",
      "source": [
        "### Decoding\n",
        "\n",
        "Tokens can be decoded with `.decode`. You can decode a single id or an entire sentence."
      ]
    },
    {
      "metadata": {
        "colab": {
          "height": 35
        },
        "executionInfo": {
          "elapsed": 54,
          "status": "ok",
          "timestamp": 1736782692607,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "B7oBsuMYEEKI",
        "outputId": "3afb206a-5474-4e3c-b09a-8220e63f0f5f"
      },
      "cell_type": "code",
      "source": [
        "tokenizer.decode([8636, 979, 78904, 603, 671, 30073, 3413, 235265])"
      ],
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'Derinkuyu is an underground city.'"
            ]
          },
          "execution_count": 66,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "colab": {
          "height": 35
        },
        "executionInfo": {
          "elapsed": 4,
          "status": "ok",
          "timestamp": 1736782747529,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "aD6pPn3AUN_r",
        "outputId": "93c5a200-329e-4cf1-e60e-2b47943b6688"
      },
      "cell_type": "code",
      "source": [
        "tokenizer.decode(4567)"
      ],
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'Med'"
            ]
          },
          "execution_count": 68,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "MbsAWP_GD1us"
      },
      "cell_type": "markdown",
      "source": [
        "## Controls tokens\n",
        "\n",
        "Some tokens have special meaning. Forgeting about those may affect the model quality significantly.\n",
        "\n",
        "Special token ids can be accessed through `tokenizer.special_tokens` attribute."
      ]
    },
    {
      "metadata": {
        "id": "Kp6Z83TbNvpX"
      },
      "cell_type": "markdown",
      "source": [
        "### `\u003cbos\u003e` / `\u003ceos\u003e`\n",
        "\n",
        "In Gemma models, the begin of sentence token (`\u003cbos\u003e`) should appear only once at the begining of the input. You can add it either explicitly or with `add_bos=True`:"
      ]
    },
    {
      "metadata": {
        "executionInfo": {
          "elapsed": 3,
          "status": "ok",
          "timestamp": 1736782020194,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "SY1vtGw-RfUl",
        "outputId": "6586a534-56f7-45a8-c09a-93554bb4ed43"
      },
      "cell_type": "code",
      "source": [
        "token_ids = tokenizer.encode('Hello world!')\n",
        "token_ids = [tokenizer.special_tokens.BOS] + token_ids\n",
        "token_ids"
      ],
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[\u003c_Gemma2SpecialTokens.BOS: 2\u003e, 4521, 2134, 235341]"
            ]
          },
          "execution_count": 55,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "executionInfo": {
          "elapsed": 551,
          "status": "ok",
          "timestamp": 1736782045743,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "rHXqhwO3M-27",
        "outputId": "bc83debe-9ff8-4695-ccfd-dcb560cfd867"
      },
      "cell_type": "code",
      "source": [
        "tokenizer.encode('Hello world!', add_bos=True)"
      ],
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[\u003c_Gemma2SpecialTokens.BOS: 2\u003e, 4521, 2134, 235341]"
            ]
          },
          "execution_count": 59,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "DEFeNl80RvLW"
      },
      "cell_type": "markdown",
      "source": [
        "Similarly, the model can output a `\u003ceos\u003e` token to indicate the prediction is complete.\n",
        "\n",
        "When fine-tuning Gemma, you can train the model to predict `\u003ceos\u003e` tokens."
      ]
    },
    {
      "metadata": {
        "executionInfo": {
          "elapsed": 2,
          "status": "ok",
          "timestamp": 1736782150952,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "2Ceam5iLSHDT",
        "outputId": "846d38ef-4d76-4fa1-a29a-79f0bbec4ac1"
      },
      "cell_type": "code",
      "source": [
        "tokenizer.encode('Hello world!', add_eos=True)"
      ],
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[4521, 2134, 235341, \u003c_Gemma2SpecialTokens.EOS: 1\u003e]"
            ]
          },
          "execution_count": 60,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "gtWLkZxVM_iW"
      },
      "cell_type": "markdown",
      "source": [
        "### `\u003cstart_of_turn\u003e` / `\u003cend_of_turn\u003e`\n",
        "\n",
        "When using the instruction-tuned version of Gemma, the `\u003cstart_of_turn\u003e` / `\u003cend_of_turn\u003e` tokens allow to specify who from the user or the model is talking.\n",
        "\n",
        "The `\u003cstart_of_turn\u003e` should be followed by either:\n",
        "\n",
        "* `user`\n",
        "* `model`\n",
        "\n",
        "Example of dialogue with user and model:"
      ]
    },
    {
      "metadata": {
        "id": "BE31hxTiSNdO"
      },
      "cell_type": "code",
      "source": [
        "token_ids = tokenizer.encode(\"\"\"\u003cstart_of_turn\u003euser\n",
        "Knock knock.\u003cend_of_turn\u003e\n",
        "\u003cstart_of_turn\u003emodel\n",
        "Who's there ?\u003cend_of_turn\u003e\n",
        "\u003cstart_of_turn\u003euser\n",
        "Gemma.\u003cend_of_turn\u003e\n",
        "\u003cstart_of_turn\u003emodel\n",
        "Gemma who?\u003cend_of_turn\u003e\"\"\")"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "metadata": {
        "colab": {
          "height": 35
        },
        "executionInfo": {
          "elapsed": 54,
          "status": "ok",
          "timestamp": 1736783201911,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "G7BwytSSV-o1",
        "outputId": "31b95f06-e978-4ad1-8c18-3948ba6d7d38"
      },
      "cell_type": "code",
      "source": [
        "tokenizer.decode(token_ids[0])"
      ],
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'\u003cstart_of_turn\u003e'"
            ]
          },
          "execution_count": 77,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "_yp75iY_1Grm"
      },
      "cell_type": "markdown",
      "source": [
        "### `\u003cstart_of_image\u003e`\n",
        "\n",
        "In Gemma 3, to indicate the position of an image in the text, the prompt should contain the special `\u003cstart_of_image\u003e` token. Internally, Gemma model will automatically expand the token to insert the soft images tokens.\n",
        "\n",
        "(Note: There's also a `\u003cend_of_image\u003e` token, but is handled internally by the model)"
      ]
    },
    {
      "metadata": {
        "id": "6M63itbXNN_8"
      },
      "cell_type": "markdown",
      "source": [
        "### Custom tokens"
      ]
    },
    {
      "metadata": {
        "id": "a0reM2p-3ceC"
      },
      "cell_type": "markdown",
      "source": [
        "In all Gemma versions, a few tokens (`99`) are unused. This allow custom applications to define and fine-tune their own custom tokens for their application. Those tokens are available through `tokenizer.special_tokens.CUSTOM + xx`, with `xx` being a number between `0` and `98`\n",
        "\n",
        "\u003c!-- TODO(epot): Add option to customize the special tokens --\u003e"
      ]
    },
    {
      "metadata": {
        "colab": {
          "height": 35
        },
        "executionInfo": {
          "elapsed": 3,
          "status": "ok",
          "timestamp": 1736941433809,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "_Yx0474YXLij",
        "outputId": "dd7e97fc-72fe-4c93-8cb6-be45ae2ef929"
      },
      "cell_type": "code",
      "source": [
        "tokenizer.decode(tokenizer.special_tokens.CUSTOM + 17)"
      ],
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'\u003cunused17\u003e'"
            ]
          },
          "execution_count": 28,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "07p_1TQAxzPS"
      },
      "cell_type": "markdown",
      "source": [
        "You can customize what the custom tokens correspond to when constructing the tokenizer."
      ]
    },
    {
      "metadata": {
        "executionInfo": {
          "elapsed": 324,
          "status": "ok",
          "timestamp": 1736944538160,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "USY1STCyxhsc",
        "outputId": "d097c84c-60d9-48dd-bff8-551b73110a77"
      },
      "cell_type": "code",
      "source": [
        "tokenizer = gm.text.Gemma3Tokenizer(\n",
        "    custom_tokens={\n",
        "        0: '\u003cmy_custom_tag\u003e',\n",
        "        17: '\u003cmy_other_tag\u003e',\n",
        "    },\n",
        ")\n",
        "\n",
        "tokenizer.encode('\u003cmy_other_tag\u003e')"
      ],
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[24]"
            ]
          },
          "execution_count": 35,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "id": "SkOFvUJi9qPj"
      },
      "cell_type": "markdown",
      "source": [
        "The custom tokens string are encoded to the matching token id."
      ]
    },
    {
      "metadata": {
        "executionInfo": {
          "elapsed": 52,
          "status": "ok",
          "timestamp": 1736944536651,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "cDSUy1BC9auP",
        "outputId": "185e8c50-ef21-4399-a76e-041744dca151"
      },
      "cell_type": "code",
      "source": [
        "tokenizer.special_tokens.CUSTOM + 17"
      ],
      "outputs": [
        {
          "data": {
            "text/plain": [
              "24"
            ]
          },
          "execution_count": 34,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "execution_count": null
    },
    {
      "metadata": {
        "colab": {
          "height": 35
        },
        "executionInfo": {
          "elapsed": 60,
          "status": "ok",
          "timestamp": 1736944668612,
          "user": {
            "displayName": "",
            "userId": ""
          },
          "user_tz": -60
        },
        "id": "Poo88ka9-EsM",
        "outputId": "1cd33db9-634c-4c60-97b5-1b28fca55fa7"
      },
      "cell_type": "code",
      "source": [
        "tokenizer.decode(tokenizer.special_tokens.CUSTOM + 17)"
      ],
      "outputs": [
        {
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'\u003cmy_other_tag\u003e'"
            ]
          },
          "execution_count": 36,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "execution_count": null
    }
  ],
  "metadata": {
    "colab": {
      "last_runtime": {},
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
